Tuning Continual Exploration in Reinforcement Learning

نویسندگان

  • Youssef Achbany
  • Francois Fouss
  • Luh Yen
  • Alain Pirotte
  • Marco Saerens
چکیده

This paper presents a model allowing to tune continual exploration in an optimal way by integrating exploration and exploitation in a common framework. It first quantifies exploration by defining the degree of exploration of a state as the entropy of the probability distribution for choosing an admissible action. Then, the exploration/exploitation tradeoff is formulated as a global optimization problem: find the exploration strategy that minimizes the expected cumulated cost, while maintaining fixed degrees of exploration at the states. In other words, exploitation is maximized for constant exploration. This formulation leads to a set of nonlinear iterative equations reminiscent of the value-iteration algorithm. Their convergence to a local minimum can be proved for a stationary environment. Interestingly, in the deterministic case, when there is no exploration, these equations reduce to the Bellman equations for finding the shortest path. If the graph of states is directed and acyclic, the nonlinear equations can easily be solved by a single backward pass from the destination state. Stochastic shortest-path problems and discounted problems are also examined, and they are compared to the SARSA algorithm. The theoretical results are confirmed by simple simulations showing that the proposed exploration strategy outperforms the ǫ-greedy and the naive Boltzmann strategies. Introduction An issue central to reinforcement learning is the tradeoff between exploration and exploitation. Exploration aims to try new ways of solving the problem, while exploitation aims to capitalize on already well-established solutions. Exploration is A preliminary version of this work appeared in the proceedings of the International Conference on Artificial Neural Networks (ICANN 2006). Marco Saerens is also a Research Fellow of the IRIDIA Laboratory, Université Libre de Bruxelles. especially relevant when the environment is changing. Then good solutions can deteriorate and better solutions can appear over time. Without exploration, the system sends agents only along the currently best paths without exploring alternative paths. The system therefore remains unaware of the changes and its performance inevitably deteriorates with time. A key feature of reinforcement learning is that it explicitly addresses the exploration/exploitation issue as well as the online estimation of the associated probability distributions in an integrated way [24]. Preliminary or initial exploration must be distinguished from continual online exploration. The objective of preliminary exploration is to discover relevant goals, or destination states, and to estimate a first, possibly optimal, policy for reaching them using search methods developed in artificial intelligence [13]. Continual online exploration aims to continually explore the environment, after the preliminary exploration stage, in order to adjust the policy to changes in the environment. Preliminary exploration can be conducted in two ways [26, 27, 28, 29]. A first group of strategies, often referred to as undirected exploration, explore at random; control actions are selected with a probability distribution, taking the expected cost into account. The second group, referred to as directed exploration, use domain-specific knowledge for guiding exploration. Directed exploration usually provides better results in terms of learning time and cost. Continual online exploration can be performed by re-exploring the environment periodically or continually [6, 21] with an ǫ-greedy or a Boltzmann exploration strategy. For instance, a joint estimation of the exploration strategy and the statetransition probabilities for continual online exploration can be performed within the SARSA framework [19, 22, 24]. Yet another attempt to integrate exploration and exploitation, this time in an temporal-difference algorithm, is presented in ?, where the authors prove the existence of at least one fixed point. This paper presents a unified framework integrating exploitation and exploration for undirected, continual, exploration, without adressing preliminary exploration in detail. Exploration is defined as the association of a probability distribution to the set of available control actions in each state (choice randomization). The degree of exploration in any given state is quantified as the (Shannon) entropy [9, 14] of this probability distribution on the set of available actions. If no exploration is performed, the agents are routed on the best path with probability one – they just exploit the solution. With exploration, the agents continually explore a possibly changing environment to keep current with it. When the entropy is zero in a state, no exploration is performed from that state, while, when the entropy is maximal, a full, blind, exploration with equal probability of choosing any action is performed. The online exploration/exploitation issue is then stated as a global optimization problem: learn the exploration strategy that minimizes the expected cumulated cost from the initial state to a goal while maintaining a fixed degree of exploration at each state. In other words, exploitation is maximized for constant exploration. This approach leads to a set of nonlinear equations defining the optimal solution. These equations can be solved by iterating them until convergence, which is proved for a stationary deterministic environment and a particular initialization strategy. Their solution provides the action policy (the probability distribution of choosing an action in each state) that minimizes the expected cost from the initial state to a destination

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimal Tuning of Continual Online Exploration in Reinforcement Learning

This paper presents a framework allowing to tune continual exploration in an optimal way. It first quantifies the rate of exploration by defining the degree of exploration of a state as the probability-distribution entropy for choosing an admissible action. Then, the exploration/exploitation tradeoff is stated as a global optimization problem: find the exploration strategy that minimizes the ex...

متن کامل

Tuning continual exploration in reinforcement learning: An optimality property of the Boltzmann strategy

This paper presents a model allowing to tune continual exploration in an optimal way by integrating exploration and exploitation in a common framework. It first quantifies exploration by defining the degree of exploration of a state as the entropy of the probability distribution for choosing an admissible action in that state. Then, the exploration/exploitation tradeoff is formulated as a globa...

متن کامل

A Monte Carlo-Based Search Strategy for Dimensionality Reduction in Performance Tuning Parameters

Redundant and irrelevant features in high dimensional data increase the complexity in underlying mathematical models. It is necessary to conduct pre-processing steps that search for the most relevant features in order to reduce the dimensionality of the data. This study made use of a meta-heuristic search approach which uses lightweight random simulations to balance between the exploitation of ...

متن کامل

Learning exploration strategies in model-based reinforcement learning

Reinforcement learning (RL) is a paradigm for learning sequential decision making tasks. However, typically the user must hand-tune exploration parameters for each different domain and/or algorithm that they are using. In this work, we present an algorithm called leo for learning these exploration strategies on-line. This algorithm makes use of bandit-type algorithms to adaptively select explor...

متن کامل

Reinforcement Learning Based PID Control of Wind Energy Conversion Systems

In this paper an adaptive PID controller for Wind Energy Conversion Systems (WECS) has been developed. Theadaptation technique applied to this controller is based on Reinforcement Learning (RL) theory. Nonlinearcharacteristics of wind variations as plant input, wind turbine structure and generator operational behaviordemand for high quality adaptive controller to ensure both robust stability an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006